Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection ergonomics #990

Merged
merged 13 commits into from
Jul 15, 2024
Merged

Collection ergonomics #990

merged 13 commits into from
Jul 15, 2024

Conversation

Yoshanuikabundi
Copy link
Collaborator

@Yoshanuikabundi Yoshanuikabundi commented Jun 12, 2024

Description

Implements easier indexing into collections and dictionaries of potentials. Closes #968.

Works by defining equality between certain TopologyKey subclasses and tuples of atom indices when non-atom index attributes are None. Also implements __getitem__ on Collection to index directly from atom indices to potential.

>>> from openff.toolkit import ForceField, Molecule
>>> interchange = ForceField("openff-2.2.0.offxml").create_interchange(
...     Molecule.from_mapped_smiles("[H:1][O:2][H:3]").to_topology()
... )
>>> interchange['Bonds'].key_map[1, 2]
PotentialKey associated with handler 'Bonds' with id '[#8:1]-[#1:2]'
>>> interchange['Bonds'][1, 2]
Potential(parameters={'k': <Quantity(1076.70159, 'kilocalorie_per_mole / angstrom ** 2')>, 'length': <Quantity(0.975230101, 'angstrom')>}, map_key=None)
>>> 

I had to change the hash implementations to not include None in the hash if all non-atom indices attributes were None so that the hash matches between the TopologyKey and the appropriate tuple. This might occasionally cause collisions... but it should be extremely rare and regardless Python's requirement is that __eq__ implies hashes are equal, not the converse.

Atom indices for bonds have to be in the correct order. Fixing this for indexing with tuples would require replacing Collection.key_map with a custom class. We could make the ergonomics easier by guaranteeing that atom indices in bonds are always ordered a certain way?

Checklist

  • Add tests
  • Lint
  • Update docstrings

Copy link

codecov bot commented Jun 12, 2024

Codecov Report

Attention: Patch coverage is 96.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.28%. Comparing base (6a1fb9e) to head (ed12deb).

Additional details and impacted files

@mattwthompson
Copy link
Member

We could make the ergonomics easier by guaranteeing that atom indices in bonds are always ordered a certain way?

Seems worth it to me - but I'm not sure about other valence terms

@mattwthompson
Copy link
Member

Okay, I see you've gone about sorting the other terms. I think this breaks angles and dihedrals, though, since they're not symmetric like bonds?

@Yoshanuikabundi
Copy link
Collaborator Author

Yoshanuikabundi commented Jul 10, 2024

We could make the ergonomics easier by guaranteeing that atom indices in bonds are always ordered a certain way?

Seems worth it to me - but I'm not sure about other valence terms

We'd have to do it everywhere because this PR just indexes in directly. It might be better to just look up all equivalent index orderings?

Okay, I see you've gone about sorting the other terms.

Welp I didn't mean to, I'll add tests and double check

EDIT: Yeah could you point out where I've done this?

I think this breaks angles and dihedrals, though, since they're not symmetric like bonds?

Yeah I agree, you can reverse the order but I think that's the only symmetry op for both?

@Yoshanuikabundi
Copy link
Collaborator Author

OK I've just made it so that when indexing into a collection with a tuple, both the forward and reverse of the tuple will be looked up. I think this is a lot less intrusive than guaranteeing everywhere that the tuples are ordered. My understanding is that all valence terms can be indexed in either direction, but if this isn't the case then I can change it to only look up the reverse for 2-tuples.

@mattwthompson
Copy link
Member

could you point out where I've done this?

Nope, I think I got confused looking too superficially at which tests you've added

@mattwthompson
Copy link
Member

Am I meant to be able to look up other valence terms, or is this messed up by all of those extra fields (some of which might not be necessary)?

In [12]: interchange = ForceField("openff-2.2.0.offxml").create_interchange(Molecule.from_smiles("
    ...: CCO").to_topology())

In [13]: next(iter(interchange['ProperTorsions'].key_map))
Out[13]: ProperTorsionKey with atom indices (0, 1, 2, 8), mult 0

In [14]: interchange['ProperTorsions'].key_map[0, 1, 2, 8]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[14], line 1
----> 1 interchange['ProperTorsions'].key_map[0, 1, 2, 8]

KeyError: (0, 1, 2, 8)

@Yoshanuikabundi
Copy link
Collaborator Author

Yoshanuikabundi commented Jul 11, 2024

Yeah that's on purpose - since I implemented this by defining __eq__ on the TopologyKey subclasses, you have to include all the information in the tuple - otherwise you violate a == b implies hash(a) == hash(b) and run into issues if you had two entries with the same atom indices. I do this implicitly by only making tuples equal to a TopologyKey when the non-indices fields of the TopologyKey are the defaults.

I thought about allowing more information to be given in the tuple - for example, to index the above torsion you could do something like:

interchange['ProperTorsions'].key_map[(0, 1, 2, 8), 0]
# Or
interchange['ProperTorsions'].key_map[(0, 1, 2, 8), 0, None, None]

But since users would have to remember the correct ordering for everything and there's no way to disambiguate the name of the field you're specifying and all the fields have the same type I thought this provided a relatively small ergonomics boost. If users want to specify the extended fields, they can still index with a ProperTorsionsKey (or whatever), which provides that documentation (via autocomplete) and disambiguation (via named arguments). I chose the existing behavior as the best ratio of ergonomics improvement to amount of new code, but I made this judgement assuming the default values of all fields would cover the majority of use cases.

Here are some ways I've thought about dealing with this:

  1. Leave things as they are

  2. Implement equality between extended tuples as above (or some subset)

  3. If mult=0 and mult=None are equivalent, change the default value of mult (and the value that __eq__ and __hash__ assume) to 0 (for bonus points, also change the type from int | None to int)

  4. If mult=0 is more common than mult=None but a breaking change is not desired, change just the value that __eq__ and __hash__ assume to 0 (I think this would be more confusing than 3)

  5. Implement a more detailed lookup in Collections.__getitem__ and if the given key is unambiguous wrt the existing keys, use it - the downside of this is that since we don't know the correct value ahead of time (or even the correct type) we'd actually have to iterate over all the existing keys to find possible matches, which changes worst case lookup from O(1) to O(n).

Note also that indexing into collection is different to indexing into collection.key_map, as I didn't want to change key_map away from being a simple dict:

In [10]: next(iter(interchange['Bonds'].key_map))
Out[10]: BondKey with atom indices (0, 1)

In [11]: interchange['Bonds'][0, 1] # Returns the underlying potential
Out[11]: Potential(parameters={'k': <Quantity(430.602781, 'kilocalorie_per_mole / angstrom ** 2')>, 'length': <Quantity(1.5336276, 'angstrom')>}, map_key=None)

In [12]: interchange['Bonds'][1, 0] # Works with reversed keys
Out[12]: Potential(parameters={'k': <Quantity(430.602781, 'kilocalorie_per_mole / angstrom ** 2')>, 'length': <Quantity(1.5336276, 'angstrom')>}, map_key=None)

In [13]: interchange['Bonds'].key_map[0, 1] # Returns the PotentialKey
Out[13]: PotentialKey associated with handler 'Bonds' with id '[#6X4:1]-[#6X4:2]'

In [14]: interchange['Bonds'].key_map[1, 0] # Doesn't work with reversed keys
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[14], line 1
----> 1 interchange['Bonds'].key_map[1, 0]

KeyError: (1, 0)

@mattwthompson mattwthompson changed the base branch from main to develop July 11, 2024 18:45
assert BondKey(atom_indices=(1, 3), bond_order=None) != ((1, 3), None)
assert BondKey(atom_indices=(1, 3), bond_order=1.5) != (1, 3)
assert BondKey(atom_indices=(1, 3), bond_order=1.5) != ((1, 3), None)
assert BondKey(atom_indices=(1, 3), bond_order=1.5) != ((1, 3), 1.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these two should be equal?

    def __hash__(self) -> int:
        if self.bond_order is None:
            return hash(tuple(self.atom_indices))
        return hash((tuple(self.atom_indices), self.bond_order))

Copy link
Collaborator Author

@Yoshanuikabundi Yoshanuikabundi Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test is correct - Lines 131, 133 and 134 are all extended tuples, which has not been implemented (as we've discussed elsewhere), and in line 133 the bond order is not None so the tuple doesn't match.

The second return line in that hash function is the existing behaviour - BondKey.__eq__ is

def __eq__(self, other) -> bool:
    return super().__eq__(other) or (
        self.bond_order is None and other == self.atom_indices
    )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I see what's going on, TopologyKey.__eq__ was changed to compare hash functions. I don't think this is generally technically correct as you might get hash collisions - Python uses a 64 bit hash, so there are 2^64 possible hashes, and a typical simulation system might have 2^16 atoms (~65k), so if you need four of them to define a torsion then you might have 2^64 possible atom index tuples... which is really filling up the pigeon holes. Granted torsions tend to be defined between nearby atoms but that's also assuming that Python's hash function is doing a good job of avoiding collisions, and it's had pathological cases in the past.

A better way might be to have a _tuple() function that gives the equivalent tuple, and then define __hash__ as return hash(self._tuple()) and __eq__ as something like return self._tuple() == other._tuple() or self._tuple() == other

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing hashes in __eq__ can also run into issues when comparing objects of different types.

Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm past providing useful feedback on this changelog, which is long past due its green check. I want to point this to the develop branch which targets an upcoming 0.4.x line; one test confuses me but otherwise please merge this when you're happy with it (% any confusion I've added).

@mattwthompson
Copy link
Member

I get why equality across extended tuples is tricky; given that the motivation of this behavior is making our lives easier, I'm fine leaving that door closed.

mult in general is ... not something I'm completely happy with but also something I'm not eager to change. BondKey basically representing a tuple with some extra (optional!) flavors feels right, but how periodic torsions are handled feels clunkier. Not to be dealt with here, but I think that's a prime candidate for a future refactor.

I looked through the docs to find a good place to show this off, but I didn't find anything. In the future that might be worth a subsection, either in the static docs or a notebook(s) somewhere. I expected there to be tons of these excessiely-difficult-lookup cases, but now I think those are just in tests and the source code. I don't see much of a reason to rewrite what's there, but with any luck I'll remember this trick and future code will be easier to write and a little bit more sightly to look at.

@Yoshanuikabundi
Copy link
Collaborator Author

Ok well I implemented extended tuples as a way to keep the TopologyKey.__eq__ implementation simple and correct while ensuring it's consistent with TopologyKey.__hash__. I've added an example of indexing to the Interchange API docs and also to the part of the user guide that describes the inner workings of Collection.

There are still a few topology key classes that are not subclasses of TopologyKey because they correspond to a single atom - I think the best way forward there would just be to make the single atom index field a property that points to self.atom_indices[0], but I haven't implemented this.

If you're happy with this change set, I'm happy for you to merge it at your convenience!

@mattwthompson mattwthompson merged commit e06176c into develop Jul 15, 2024
21 checks passed
@mattwthompson mattwthompson deleted the collection-ergonomics branch July 15, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Easier indexing into collections
2 participants